**Merge the two plots and use ggplot2 to plot the combined plot. Use plotThreshVsPerf in mlr to create plots. *Make it so we can calculate various types of metric: Accuracy, PPV, NPV, Sensitivity, Specificity, etc. Compare PARE values against AUC, etc. values to see how strongly they are correlated (and when/if they are not). Compare against F-score? Multiclass ROC (how to deal with multiple predictions per sample?) For multiclass, can you just permute each column in the probability matrix to get the baseline score? Extremely large sample size, large, medium, small, extremely small Use ggplot2 for the plotROC function below. Address the question that people ask, “At what level of accuracy do you get excited?” Use bootstrap subsampling to calculate confidence intervals as a way to address this. Should see wider intervals for smaller data sets. Find optimal cut point and report that. Report it as max difference from permuted baseline? Show how score increases as amount of signal increases. Monotonically? Compare it with what you see for other metrics. Smooth the permuted accuracy values. Emphasize how you can use this method to quantify whether performance is better than clinical/baseline predictors. Can you apply it to regression problems?
numRandomValues = 10000
imbalanceProportion = 0.99
set.seed(0)
actual = factor(c(rep(0, numRandomValues * 0.5), rep(1, numRandomValues * 0.5)))
randomScores = rnorm(numRandomValues)
randomScores = (randomScores - min(randomScores)) / (max(randomScores) - min(randomScores))
mediumSignalScores = c(rnorm(numRandomValues * 0.5), rnorm(numRandomValues * 0.5, mean=2))
mediumSignalScores = (mediumSignalScores - min(mediumSignalScores)) / (max(mediumSignalScores) - min(mediumSignalScores))
signalScores = c(rnorm(numRandomValues * 0.5), rnorm(numRandomValues * 0.5, mean=10))
signalScores = (signalScores - min(signalScores)) / (max(signalScores) - min(signalScores))
oppositeScores = c(rnorm(numRandomValues * 0.5, mean=10), rnorm(numRandomValues * 0.5))
oppositeScores = (oppositeScores - min(oppositeScores)) / (max(oppositeScores) - min(oppositeScores))
signalBinaryScores = as.integer(as.logical(as.numeric(as.vector(actual))))
oppositeBinaryScores = as.integer(!as.logical(as.numeric(as.vector(actual))))
singleValueScores = rep(as.integer(levels(actual)[2]), length(actual))
actualImbalanced = factor(c(rep(0, numRandomValues * imbalanceProportion), rep(1, numRandomValues * (1 - imbalanceProportion))))
mediumSignalScoresImbalanced = c(rnorm(numRandomValues * imbalanceProportion), rnorm(numRandomValues * (1 - imbalanceProportion), mean=2))
mediumSignalScoresImbalanced = (mediumSignalScoresImbalanced - min(mediumSignalScoresImbalanced)) / (max(mediumSignalScoresImbalanced) - min(mediumSignalScoresImbalanced))
signalScoresImbalanced = c(rnorm(numRandomValues * imbalanceProportion), rnorm(numRandomValues * (1 - imbalanceProportion), mean=10))
signalScoresImbalanced = (signalScoresImbalanced - min(signalScoresImbalanced)) / (max(signalScoresImbalanced) - min(signalScoresImbalanced))
oppositeScoresImbalanced = c(rnorm(numRandomValues * imbalanceProportion, mean=10), rnorm(numRandomValues * (1 - imbalanceProportion)))
oppositeScoresImbalanced = (oppositeScoresImbalanced - min(oppositeScoresImbalanced)) / (max(oppositeScoresImbalanced) - min(oppositeScoresImbalanced))
signalBinaryScoresImbalanced = as.integer(as.logical(as.numeric(as.vector(actualImbalanced))))
oppositeBinaryScoresImbalanced = as.integer(!as.logical(as.numeric(as.vector(actualImbalanced))))
hist(randomScores, breaks=50, main="Random Scores")
hist(mediumSignalScores, breaks=50, main="Medium Signal Scores")
hist(signalScores, breaks=50, main="Signal Scores")
hist(oppositeScores, breaks=50, main="Opposite Scores")
hist(signalBinaryScores, breaks=50, main="Signal Binary Scores")
hist(oppositeBinaryScores, breaks=50, main="Opposite Binary Scores")
boxplot(randomScores~actual, main="Random Scores")
boxplot(mediumSignalScores~actual, main="Medium Signal Scores")
boxplot(signalScores~actual, main="Signal Scores")
boxplot(oppositeScores~actual, main="Opposite Scores")
boxplot(signalBinaryScores~actual, main="Signal Binary Scores")
boxplot(oppositeBinaryScores~actual, main="Opposite Binary Scores")
hist(mediumSignalScoresImbalanced, breaks=50, main="Medium Signal Scores Imbalanced")
hist(signalScoresImbalanced, breaks=50, main="Signal Scores Imbalanced")
hist(oppositeScoresImbalanced, breaks=50, main="Opposite Scores Imbalanced")
hist(signalBinaryScoresImbalanced, breaks=50, main="Signal Binary Scores Imbalanced")
hist(oppositeBinaryScoresImbalanced, breaks=50, main="Opposite Binary Scores Imbalanced")
boxplot(mediumSignalScoresImbalanced~actualImbalanced, main="Medium Signal Scores Imbalanced")
boxplot(signalScoresImbalanced~actualImbalanced, main="Signal Scores Imbalanced")
boxplot(oppositeScoresImbalanced~actualImbalanced, main="Opposite Scores Imbalanced")
boxplot(signalBinaryScoresImbalanced~actualImbalanced, main="Signal Binary Scores Imbalanced")
boxplot(oppositeBinaryScoresImbalanced~actualImbalanced, main="Opposite Binary Scores Imbalanced")
library(AUC)
## AUC 0.3.0
## Type AUCNews() to see the change log and ?AUC to get an overview.
plotROC = function(actual, scores, main)
{
rocResult = roc(scores, actual)
par(mar=c(4.1, 4.6, 2.1, 0.6))
plot(rocResult$fpr, rocResult$tpr, type="l", main=main, xlab="False positive rate", ylab="True positive rate")
abline(0, 1, lty=2, col=2)
}
plotROC(actual, randomScores, "Random Scores")
plotROC(actual, mediumSignalScores, main="Medium Signal Scores")
plotROC(actual, signalScores, main="Signal Scores")
plotROC(actual, oppositeScores, main="Opposite Scores")
plotROC(actual, signalBinaryScores, main="Signal Binary Scores")
plotROC(actual, singleValueScores, main="Single Value Scores")
plotROC(actual, oppositeBinaryScores, main="Opposite Binary Scores")
plotROC(actualImbalanced, mediumSignalScoresImbalanced, main="Medium Signal Scores Imbalanced")
plotROC(actualImbalanced, signalScoresImbalanced, main="Signal Scores Imbalanced")
plotROC(actualImbalanced, oppositeScoresImbalanced, main="Opposite Scores Imbalanced")
plotROC(actualImbalanced, signalBinaryScoresImbalanced, main="Signal Binary Scores Imbalanced")
plotROC(actualImbalanced, oppositeBinaryScoresImbalanced, main="Opposite Binary Scores Imbalanced")
pare <- function(response, predictor, main="", metricLabel="Accuracy")
{
metricLabel = paste("PARE Score (", metricLabel, ")", sep="")
combined = cbind(as.vector(response), predictor)
combined = apply(combined, 2, as.numeric)
combined = combined[order(response),]
response = combined[,1]
predictor = combined[,2]
thresholdStepSize = (max(predictor) - min(predictor)) / (length(predictor) + 1)
thresholds = seq(min(predictor), max(predictor), thresholdStepSize)
baselineAccuracyAtThresholds = rep(max(table(response)) / length(response), length(thresholds))
permutedAccuracyAtThresholds = NULL
set.seed(0)
for (threshold in thresholds)
{
predictorDiscretized = as.integer(predictor > threshold)
permutedResponse = sample(response)
permutedAccuracy = sum(predictorDiscretized == permutedResponse) / length(response)
permutedAccuracyAtThresholds = c(permutedAccuracyAtThresholds, permutedAccuracy)
}
accuracyAtThresholds = calculateAccuracyAtThresholds(response, predictor, thresholds)
idealPredictor = sort(predictor)
idealAccuracyAtThresholds = calculateAccuracyAtThresholds(response, idealPredictor, thresholds)
wrongPredictor = sort(predictor, decreasing=TRUE)
wrongAccuracyAtThresholds = calculateAccuracyAtThresholds(response, wrongPredictor, thresholds)
baselineDiff = mean(idealAccuracyAtThresholds - permutedAccuracyAtThresholds)
actualDiff = mean(accuracyAtThresholds - permutedAccuracyAtThresholds)
pareScore = 0
if (baselineDiff != 0)
pareScore = actualDiff / baselineDiff
if (pareScore < 0)
{
baselineDiff = mean(permutedAccuracyAtThresholds - wrongAccuracyAtThresholds)
pareScore = actualDiff / baselineDiff
}
main = paste("PARE - ", main, "\n", "(Score = ", format(pareScore, digits=3, nsmall=3), ")", sep="")
plot(thresholds, accuracyAtThresholds, type="l", ylim=c(0, 1), main=main, ylab=metricLabel)
lines(thresholds, idealAccuracyAtThresholds, col=2)
lines(thresholds, permutedAccuracyAtThresholds, col=3)
legend("topleft", lwd=2, legend=c("Actual", "Ideal", "Permuted"), col=1:3, cex=0.8)
actualDiffPermuted = accuracyAtThresholds - permutedAccuracyAtThresholds
# Not sure if we want this graph (or maybe it's just a supplement) because it doesn't show actual accuracy scores??
plot(thresholds, actualDiffPermuted, type="l", xlim=c(min(thresholds), max(thresholds)), ylim=c(min(accuracyAtThresholds - permutedAccuracyAtThresholds), max(accuracyAtThresholds - permutedAccuracyAtThresholds)), main=main, xlab="Threshold", ylab=metricLabel)
abline(v=thresholds[median(which(actualDiffPermuted==max(actualDiffPermuted)))], col=2, lty=2, lwd=2)
abline(v=c(min(thresholds), max(thresholds)), col="gray", lty=2, lwd=1)
abline(h=c(-1, 1), col="gray", lty=2, lwd=1)
}
calculateAccuracyAtThresholds <- function(response, predictor, thresholds)
{
accuracyAtThresholds = NULL
for (threshold in thresholds)
{
predictorDiscretized = as.integer(predictor > threshold)
accuracy = sum(predictorDiscretized == response) / length(response)
accuracyAtThresholds = c(accuracyAtThresholds, accuracy)
}
return(accuracyAtThresholds)
}
pare(actual, randomScores, main="Random Scores")
pare(actual, mediumSignalScores, main="Medium Signal Scores")
pare(actual, signalScores, main="Signal Scores")
pare(actual, oppositeScores, main="Opposite Scores")
pare(actual, signalBinaryScores, main="Signal Binary Scores")
pare(actual, oppositeBinaryScores, main="Wrong Binary Scores")
pare(actual, singleValueScores, main="Single Value Scores")
pare(actualImbalanced, randomScores, main="Random Scores Imbalanced")
pare(actualImbalanced, mediumSignalScoresImbalanced, main="Medium Signal Scores Imbalanced")
pare(actualImbalanced, signalScoresImbalanced, main="Signal Scores Imbalanced")
pare(actualImbalanced, oppositeScoresImbalanced, main="Opposite Scores Imbalanced")
pare(actualImbalanced, signalBinaryScoresImbalanced, main="Signal Binary Scores Imbalanced")
pare(actualImbalanced, oppositeBinaryScoresImbalanced, main="Opposite Binary Scores Imbalanced")
PARE Plot = Performance Above Random Expectation A key contribution would be that it generalizes beyond two classes.
Its roots in signal processing make it more difficult for people to understand.
Here’s a list of classification measures: https://mlr-org.github.io/mlr-tutorial/tutorial/release/html/measures/index.html
One non-intuitive thing about ROC plots is that it is varying the threshold along an axis, but it is does not show that threshold explicitly on the plot. It’s also not intuitive that an AUC value of 0.5 is what you expect by chance. Also, it’s difficult to calculate the AUC. And it doesn’t work for multi-class problems. Also sensitivity and specificity are difficult for many people to understand.
What is positive about this metric is that it is robust to class imbalance and is used widely and relatively easy to interpret because it can be related loosely back to accuracy. And that it doesn’t force you to choose one specific threshold.
Come up with 1) an alternative metric or metrics that can be used instead of this and that can be interpreted similarly to accuracy (?) and 2) a way of visualizing it. Perhaps along the X axis you show the range of thresholds. The y axis could be sensitivity and specificity (two separate lines) or the difference between the two or something along those lines. Or it could simply be accuracy (a good start). Or perhaps have difference metrics (potentially including positive predictive value and negative predictive value or even the log-rank statistic for survival analyses).
What this could also show you is which is the optimal threshold to use in clinical settings.
To get this into a top journal (Nature Methods), you could write this as a tutorial/review article and add your own metric at the end.
These curves seem to have two main purposes: 1) assessing the balance between sensitivity and specificity and 2) comparing classifiers. Need to address both of these needs.
Develop a tool in Python (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html) and/or in R (http://stackoverflow.com/questions/4954507/calculate-the-area-under-a-curve-in-r). The inputs would be a vector of actual labels and a vector of predictions (continuous is the best option but could also support discrete).
Within an IPython notebook, simulate various normal and extreme scenarios and illustrate what the metric produces in those scenarios. Also compare it to what other metrics produce (possible alternatives here: http://en.wikipedia.org/wiki/Receiver_operating_characteristic). Fine tune it accordingly. Then show some examples using real data.
Can this be applied to multiclass data? See info here: http://en.wikipedia.org/wiki/Receiver_operating_characteristic.
Look at paper entitled, “Union–intersection permutation solution for two-sample equivalence testing.”
Summary of performance measures in machine learning: http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
Only accuracy generalizes beyond two classes…
Nicolas Robine @notSoJunkDNA
KP: Side point: Harmonic mean of precision and recall (Fmax statistics) is better than AUC in many genomics application… #5points
25 Mar 2015 7:05 via Twitter for Mac
https://twitter.com/notSoJunkDNA/status/580732327149596672
Harmonic mean of precision/recall is described in http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
https://rocr.bioinf.mpi-sb.mpg.de/
Has an implementation of accuracy threshold plot. Many other metrics, too.
The precision-recall curve is an alternative. But it’s not really better. This presentation provides some theoretical background: http://www.ke.tu-darmstadt.de/lehre/archiv/ws0708/ml-sem/Folien/Wen_Zhang.pdf
Here’s another alternative along with some more theory: http://link.springer.com/article/10.1007%2Fs10994-009-5119-5
R package described here: http://www.hmeasure.net
Claims that AUC is equivalent to the “Gini coefficient”.
“For example, if ROC curves cross then it is possible that one curve has a larger AUC (and so is apparently better) even though the alternative may show superior performance over almost the entire range of values of the classification threshold (defined below) for which the curve will be used.”
“a much more fundamental weakness of the AUC which appears not to have been previously recognised. This is that, as we show below, the AUC is equivalent to measuring the performance of classification rules using metrics which depend on the rules being measured. In particular, instead of regarding the relative severities of different kinds of misclassifications (i.e., misclassifying a class 0 object as class 1, and a class 1 as class 0) as the same for different classifiers, the AUC takes these relative severities themselves to depend on the classifier being measured. This is, of course, nonsensical, since the relative severities, even if one does not know them precisely, are independent of the classifiers themselves and must be determined by factors describing the problem external to the classifiers. It is as if one chose to compare the heights of two people using rulers in which the basic units of measurement themselves depended on the heights.”
Here’s another one: http://users.dsic.upv.es/~flip/ROCAI2004/papers/03-vfinal-Drummond_and_Holte-A4.pdf They emphasize confidence intervals, which is important and should be included in our tool (bootstrap approach?).
Another: http://www.nssl.noaa.gov/users/brooks/public_html/feda/papers/Drummond%20and%20Holte.pdf
http://www.igi-global.com/article/alternative-approach-evaluating-binormal-roc/80236
http://www.jstor.org/discover/10.2307/3702911?sid=21105221364251&uid=2&uid=4&uid=3739256&uid=3739808
Use the comparison methodology described here?: http://webdocs.cs.ualberta.ca/~ozlem/papers/TR2009.pdf
Another possibility is to come up with an alternative to survival analysis: http://errorstatistics.com/2013/04/19/stephen-senn-when-relevance-is-irrelevant/